Each of the variables are a numeric-continuous data type. We have arrests per 100,000 people for three violent crimes: assault, murder, and rape. We also have a column indicating the degree of urban population in that state. Before preceding with prediction, we note that tree-based techniques can be more unstable if the variables are too correlated with one another. We can also see if there are any extreme skews in the data.
library(GGally)
Warning: package 'GGally' was built under R version 4.3.2
ggpairs(USArrests)
We do see some positive relationships and stronger correlations, but mayne not quite enough to get us in trouble.
Now lets try and predict Murder using the other features.
dt =rpart(Murder ~.,data=USArrests)rpart.plot(dt)
We can calculate a kind of R-squared measure of accuracy by squaring the correlation between the actual Murder values with our predicted ones.
USArrests %>%mutate(predicted_murder =predict(dt, USArrests)) %>%select(Murder, predicted_murder) %>%cor() -> corrmatrsq = corrmat[["Murder", "predicted_murder"]]^2print(paste("The r-square for our model is", round(rsq,2), sep=": "))